Database annotation and retrieval

ABSTRACT

A data structure is provided for annotating data files within a database. The annotation data comprises a phoneme and word lattice which allows the quick and efficient searching of data files within the database, in response to a user&#39;s input query for desired information. The phoneme and word lattice comprises a plurality of time-ordered nodes, and a plurality of links extending between the nodes. Each link has a phoneme or word associated with it. The nodes are arranged in a sequence of time-ordered blocks such that further data can be conveniently added to the lattice.

This application is a National Stage Filing Under 35 U.S.C. 371 ofInternational Application No. PCT/GB01/04331, filed Sep. 28, 2001, andpublished in English as International Publication No. WO 02/27546 A2, onApr. 4, 2002.

The present invention relates to the annotation of data files which areto be stored in a database for facilitating their subsequent retrieval.The present invention is also concerned with a system for generating theannotation data which is added to the data file and to a system forsearching the annotation data in the database to retrieve a desired datafile in response to a user's input query. The invention also relates toa system for translating an unordered list of nodes and links into anordered and blocked list of nodes and links.

Databases of information are well known and suffer from the problem ofhow to locate and retrieve the desired information from the databasequickly and efficiently. Existing database search tools allow the userto search the database using typed keywords. Whilst this is quick andefficient, this type of searching is not suitable for various kinds ofdatabases, such as video or audio databases.

According to one aspect, the present invention aims to provide a datastructure for the annotation of data files within a database which willallow a quick and efficient search to be carried out in response to auser's input query.

According to another aspect, the present invention According to anotheraspect, the present invention provides data defining a phoneme and wordlattice for use as annotation data for annotating data files to bestored within a database. Preferably, the data defines a plurality ofnodes and a plurality of links connecting the nodes, and further dataassociates a plurality of phonemes with a respective plurality of linksand further data associates at least one word with at least one of saidlinks, and further data defines a block arrangement for the nodes suchthat the links may only extend over a given maximum number of blocks. Itis further preferred that the links may only extend into a followingblock.

According to another aspect, the present invention provides an apparatusfor searching a database which employs the annotation data discussedabove for annotating data filed therein. Preferably, the apparatus isarranged to generate phoneme data in response to a user's query orinput, and to search the database using the generated phoneme data. Itis further preferred that word data is also generated from the user'sinput or query.

According to another aspect, the present invention provides an apparatusfor generating a phoneme and word lattice corresponding to receivedphoneme and word data, comprising means for defining a plurality oflinks and a plurality of nodes between which the links extend, means forassociating the links with phonemes or words, and means for arrangingthe nodes in a sequence of time ordered blocks in which the links onlyextend up to a maximum given number of blocks later in the sequence.Preferably, the maximum extension allowed for a link is to extend into afollowing block. It is further preferred that the apparatus is arrangedto add nodes or links incrementally as it forms the lattice, and tosplit an existing block of nodes into at least two blocks of nodes.

According to another aspect, the present invention provides an apparatusfor adding phonemes or words to a phoneme and word lattice of any of thetypes discussed above, and arranged to analyse which data defining thecurrent phoneme and word lattice needs to be modified in dependence uponthe extent to which the links are permitted to extend from one block toanother. Preferably, this analysis is further dependent upon thelocation within the lattice of a point identifying the latest node ineach block to which any link originating in the preceding block extendsand a point identifying the earliest node in each block from which alink extends into the succeeding block.

According to another aspect, the present invention provides a method ofadding phonemes or words to a phoneme and word lattice of any of thetypes discussed above, comprising analysing which data defining thecurrent phoneme and word lattice needs to be modified in dependence uponthe extent to which the links are Preferably, this analysis is furtherdependent upon the location within the lattice of respective pointsidentifying the latest node in each block to which any link originatingin the preceding block extends.

According to another aspect, a method and apparatus are provided forconverting an unordered list of nodes and links into an ordered andblocked list of nodes and links. The blocks are formed by filling andsplitting: successive nodes are inserted into a block until it is full,then a new block is begun. If new nodes would overfill an already fullblock, that block is split into two or more blocks. Constraints on thelinks regarding which block they can lead to are used to speed up theblock splitting process, and identify which nodes remain in the oldblock and which go into the new block.

Exemplary embodiments of the present invention will now be describedwith reference to the accompanying figures, in which:

FIG. 1 is a schematic view of a computer which is programmed to operatean embodiment of the present invention;

FIG. 2 is a block diagram showing a phoneme and word annotator unitwhich is operable to generate phoneme and word annotation data forappendage to a data file;

FIG. 3 is a block diagram illustrating one way in which FIG. 3 is ablock diagram illustrating one way in which the phoneme and wordannotator can generate the annotation data from an input video datafile;

FIG. 4 a is a schematic diagram of a phoneme lattice for an exampleaudio string from the input video data file;

FIG. 4 b is a schematic diagram of a word and phoneme lattice embodyingone aspect of the present invention, for an example audio string fromthe input video data file;

FIG. 5 is a schematic block diagram of a user's terminal which allowsthe user to retrieve information from the database by a voice query;

FIG. 6 is a schematic diagram of a pair of word and phoneme lattices,for example audio strings from two speakers;

FIG. 7 is a schematic block diagram illustrating a user terminal whichallows the annotation of a data file with annotation data generated froman audio signal input from a user;

FIG. 8 is a schematic diagram of phoneme and word lattice annotationdata which is generated for an example utterance input by the user forannotating a data file;

FIG. 9 is a schematic block diagram illustrating a user terminal whichallows the annotation of a data file with annotation data generated froma typed input from a user;

FIG. 10 is a schematic diagram of phoneme and word lattice annotationdata which is generated for a typed input by the user for annotating adata file;

FIG. 11 is a block schematic diagram showing the form of a documentannotation system;

FIG. 12 is a block schematic diagram of an alternative documentannotation system;

FIG. 13 is a block schematic diagram of another document annotationsystem;

FIG. 14 is a schematic block diagram illustrating the way in which aphoneme and word lattice can be generated from script data containedwithin a video data file;

FIG. 15 a is a schematic diagram of a word and phoneme lattice showingrelative timings of the nodes of the lattice;

FIG. 15 b is a schematic diagram showing the nodes of a word and phonemelattice divided into blocks.

FIG. 16 a is a schematic diagram illustrating the format of datacorresponding to one node of a word and phoneme lattice;

FIG. 16 b is a schematic diagram illustrating a data stream defining aword and phoneme lattice;

FIG. 17 is a flow diagram illustrating a process of forming a word andphoneme lattice according to one embodiment of the present invention;

FIGS. 18 a to 18 h are schematic diagrams illustrating the build-up of aword and phoneme lattice;

FIGS. 19 a to 19 h are schematic diagrams illustrating the build-up of adata stream defining a word and phoneme lattice;

FIGS. 20 a to 20 c are schematic diagrams showing the updating of a wordand phoneme lattice on insertion of a long link;

FIGS. 21 a to 21 d are schematic diagrams illustrating the updating of aword and phoneme lattice on insertion of additional nodes;

FIG. 22 is a flow diagram illustrating a procedure of adjustingoff-sets;

FIGS. 23 a and 23 b are schematic diagrams illustrating the applicationof a block splitting procedure to a word and phoneme lattice; and

FIG. 24 is a block diagram illustrating one way in which the phoneme andword annotator can generate the annotation data from an input video datafile.

Embodiments of the present invention can be implemented using dedicatedhardware circuits, but the embodiment to be described is implemented incomputer software or code, which is run in conjunction with processinghardware such as a personal computer, work station, photocopier,facsimile machine, personal digital assistant (PDA) or the like.

FIG. 1 shows a personal computer (PC) 1 which is programmed to operatean embodiment of the present invention. A keyboard 3, a pointing device5, a microphone 7 and a telephone line 9 are connected to the PC 1 viaan interface 11. The keyboard 3 and pointing device 5 enable the systemto be controlled by a user. The microphone 7 converts acoustic speechsignals from the user into equivalent electrical signals and suppliesthem to the PC 1 for processing. An internal modem and speech receivingcircuit (not shown) is connected to the telephone line 9 so that the PC1 can communicate with, for example, a remote computer or with a remoteuser.

The programme instructions which make the PC 1 operate in accordancewith the present invention may be supplied for use with an existing PC 1on, for example, a storage device such as a magnetic disc 13, or bydownloading the software from the Internet (not shown) via the internalmodem and telephone line 9.

Data File Annotation

FIG. 2 is a block diagram illustrating the way in which annotation data21 for an input data file 23 is generated in this embodiment by aphoneme and word annotating unit 25. As shown, the generated phoneme andword annotation data 21 is then combined with the data file 23 in thedata combination unit 27 and the combined data file output thereby isinput to the database 29. In this embodiment, the annotation data 21comprises a combined phoneme (or phoneme like) and word lattice whichallows the user to retrieve information from the database by a voicequery. As those skilled in the art will appreciate, the data file 23 canbe any kind of data file, such as, a video file, an audio file, amultimedia file etc.

A system has been proposed to generate N-Best word lists for an audiostream as annotation data by passing the audio data from a video datafile through an automatic speech recognition unit. However, suchword-based systems suffer from a number of problems. These include (i)that state of the art speech recognition systems still make basicmistakes in recognition; (ii) that state of the art automatic speechrecognition systems use a dictionary of perhaps 20,000 to 100,000 wordsand cannot produce words outside that vocabulary; and (iii) that theproduction of N-Best lists grows exponentially with the number ofhypothesis at each stage, therefore resulting in the annotation databecoming prohibitively large for long utterances.

The first of these problems may not be that significant if the sameautomatic speech recognition system is used to generate the annotationdata and to subsequently retrieve the corresponding data file, since thesame decoding error could occur. However, with advances in automaticspeech recognition systems being made each year, it is likely that inthe future the same type of error may not occur, resulting in theinability to be able to retrieve the corresponding data file at thatlater date. With regard to the second problem, this is particularlysignificant in video data applications, since users are likely to usenames and places (which may not be in the speech recognition dictionary)as input query terms. In place of these names, the automatic speechrecognition system will typically replace the out of vocabulary wordswith a phonetically similar word or words within the vocabulary, oftencorrupting nearby decodings. This can also result in the failure toretrieve the required data file upon subsequent request.

In contrast, with the proposed phoneme and word lattice annotation data,a quick and efficient search using the word data in the database 29 canbe carried out and, if this fails to provide the required data file,then a further search using the more robust phoneme data can beperformed. The phoneme and word lattice is an acyclic directed graphwith a single entry point and a single exit point. It representsdifferent parses of the audio stream within the data file. It is notsimply a sequence of words with alternatives since each word does nothave to be replaced by a single alternative, one word can be substitutedfor two or more words or phonemes, and the whole structure can form asubstitution for one or more words or phonemes. Therefore, the densityof data within the phoneme and word lattice essentially remains linearthroughout the audio data, rather than growing exponentially as in thecase of the N-Best technique discussed above. As those skilled in theart of speech recognition will realise, the use of phoneme data is morerobust, because phonemes are dictionary independent and allow the systemto cope with out of vocabulary words, such as names, places, foreignwords etc. The use of phoneme data is also capable of making the systemfuture proof, since it allows data files which are placed into thedatabase to be retrieved even when the words were not understood by theoriginal automatic speech recognition system.

The way in which this phoneme and word lattice annotation data can begenerated for a video data file will now be described with reference toFIG. 3. As shown, the video data file 31 comprises video data 31-1,which defines the sequence of images forming the video sequence andaudio data 31-2, which defines the audio which is associated with thevideo sequence. As is well known, the audio data 31-2 is timesynchronised with the video data 31-1 so that, in use, both the videoand audio data are supplied to the user at the same time.

As shown in FIG. 3, in this embodiment, the audio data 31-2 is input toan automatic speech recognition unit 33, which is operable to generate aphoneme lattice corresponding to the stream of audio data 31-2. Such anautomatic speech recognition unit 33 is commonly available in the artand will not be described in further detail. The reader is referred to,for example, the book entitled ‘Fundamentals of Speech Recognition’ byLawrence Rabiner and Biing-Hwang Juang and, in particular, to pages 42to 50 thereof, for further information on this type of speechrecognition system.

FIG. 4 a illustrates the form of the phoneme lattice data output by thespeech recognition unit 33, for the input audio corresponding to thephrase '. . . now is the winter of our . . . '. The automatic speechrecognition unit 33 identifies a number of different possible phonemestrings which correspond to this input audio utterance. For example, thespeech recognition system considers that the first phoneme in the audiostring is either an /m/ or an /n/. For clarity, only the alternativesfor the first phoneme are shown. As is well known in the art of speechrecognition, these different possibilities can have their own weightingwhich is generated by the speech recognition unit 33 and is indicativeof the confidence of the speech recognition unit's output. For example,the phoneme /n/ may be given a weighting of 0.9 and the phoneme /m/ maybe given a weighting of 0.1, indicating that the speech recognitionsystem is fairly confident that the corresponding portion of audiorepresents the phoneme /n/, but that it still may be the phoneme /m/.

In this embodiment, however, this weighting of the phonemes is notperformed.

As shown in FIG. 3, the phoneme lattice data 35 output by the automaticspeech recognition unit 33 is input to a word decoder 37 which isoperable to identify possible words within the phoneme lattice data 35.In this embodiment, the words identified by the word decoder 37 areincorporated into the phoneme lattice data structure. For example, forthe phoneme lattice shown in FIG. 4 a, the word decoder 37 identifiesthe words “NOW”, “IS”, “THE”, “WINTER”, “OF” and “OUR”. As shown in FIG.4 b, these identified words are added to the phoneme lattice datastructure output by the speech recognition unit 33, to generate aphoneme and word lattice data structure which forms the annotation data31-3. This annotation data 31-3 is then combined with the video datafile 31 to generate an augmented video data file 31′ which is thenstored in the database 29. As those skilled in the art will appreciate,in a similar way to the way in which the audio data 31-2 is timesynchronised with the video data 31-1, the annotation data 31-3 is alsotime synchronised and associated with the corresponding video data 31-1and audio data 31-2, so that a desired portion of the video and audiodata can be retrieved by searching for and locating the correspondingportion of the annotation data 31-3.

In this embodiment, the annotation data 31-3 stored in the database 29has the following general form:

-   -   Header        -   time of start        -   flag if word if phoneme if mixed    -   time index associating the location of blocks of annotation data        within memory to a given time point.        -   word set used (i.e. the dictionary)        -   phoneme set used        -   phoneme probability data        -   the language to which the vocabulary pertains    -   Block(i) i=0,1,2, . . .        -   node N_(j) j=0,1,2, . . .            -   time offset of node from start of block            -   phoneme links (k) k=0,1,2 . . . offset to node                N_(j)=N_(k)−N_(j) (N_(k) is node to which link K                extends) or if N_(k) is in block (i+1) offset to node                N_(j)=N_(k)+N_(b)−N_(j) (where N_(b) is the number of                nodes in block (i)) phoneme associated with link (k)            -   word links (l) l=0,1,2 . . . offset to node                N_(j)=N_(i)−N_(j) (N_(j) is node to which link l                extends) or if N_(k) is in block (i+1) offset to node                N_(j)=N_(k)+N_(b)−N_(j) (where N_(b) is the number of                nodes in block (i)) word associated with link (l)

The time of start data in the header can identify the time and date oftransmission of the data. For example, if the video file is a newsbroadcast, then the time of start may include the exact time of thebroadcast and the date on which it was broadcast.

The flag identifying if the annotation data is word annotation data,phoneme annotation data or if it is mixed is provided since not all thedata files within the database will include the combined phoneme andword lattice annotation data discussed above, and in this case, adifferent search strategy would be used to search this annotation data.

In this embodiment, the annotation data is divided into blocks in orderto allow the search to jump into the middle of the annotation data for agiven audio data stream. The header therefore includes a time indexwhich associates the location of the blocks of annotation data withinthe memory to a given time offset between the time of start and the timecorresponding to the beginning of the block.

The header also includes data defining the word set used (i.e. thedictionary), the phoneme set used and the language to which thevocabulary pertains. The header may also include details of theautomatic speech recognition system used to generate the annotation dataand any appropriate settings thereof which were used during thegeneration of the annotation data.

The phoneme probability data defines the probability of insertions,deletions, misrecognitions and decodings for the system, such as anautomatic speech recognition system, which generated the annotationdata.

The blocks of annotation data then follow the header and identify, foreach node in the block, the time offset of the node from the start ofthe block, the phoneme links which connect that node to other nodes byphonemes and word links which connect that node to other nodes by words.Each phoneme link and word link identifies the phoneme or word which isassociated with the link. They also identify the offset to the currentnode. For example, if node N₅₀ is linked to node N₅₅ by a phoneme link,then the offset to node N₅₀ is 5. As those skilled in the art willappreciate, using an offset indication like this allows the division ofthe continuous annotation data into separate blocks.

In an embodiment where an automatic speech recognition unit outputsweightings indicative of the confidence of the speech recognition unitsoutput, these weightings or confidence scores would also be includedwithin the data structure. In particular, a confidence score would beprovided for each node which is indicative of the confidence of arrivingat the node and each of the phoneme and word links would include atransition score depending upon the weighting given to the correspondingphoneme or word. These weightings would then be used to control thesearch and retrieval of the data files by discarding those matches whichhave a low confidence score.

Data File Retrieval

FIG. 5 is a block diagram illustrating the form of a user terminal 59which can be used to retrieve the annotated data files from the database29. This user terminal 59 may be, for example, a personal computer, handheld device or the like. As shown, in this embodiment, the user terminal59 comprises the database 29 of annotated data files, an automaticspeech recognition unit 51, a search engine 53, a control unit 55 and adisplay 57. In operation, the automatic speech recognition unit 51 isoperable to process an input voice query from the user 39 received viathe microphone 7 and the input line 61 and to generate therefromcorresponding phoneme and word data. This data may also take the form ofa phoneme and word lattice, but this is not essential. This phoneme andword data is then input to the control unit 55 which is operable toinitiate an appropriate search of the database 29 using the searchengine 53. The results of the search, generated by the search engine 53,are then transmitted back to the control unit 55 which analyses thesearch results and generates and displays appropriate display data tothe user via the display 57. More details of the search techniques whichcan be performed are given in co-pending applications PCT/GB00/00718 andGB9925561.4, the contents of which are incorporated herein by reference.

ALTERNATIVE EMBODIMENTS

As those skilled in the art will appreciate, this type of phonetic andword annotation of data files in a database provides a convenient andpowerful way to allow a user to search the database by voice. In theillustrated embodiment, a single audio data stream was annotated andstored in the database for subsequent retrieval by the user. As thoseskilled in the art will appreciate, when the input data file correspondsto a video data file, the audio data within the data file will usuallyinclude audio data for different speakers. Instead of generating asingle stream of annotation data for the audio data, separate phonemeand word lattice annotation data can be generated for the audio data ofeach speaker. This may be achieved by identifying, from the pitch orfrom another distinguishing feature of the speech signals, the audiodata which corresponds to each of the speakers and then by annotatingthe different speaker's audio separately. This may also be achieved ifthe audio data was recorded in stereo or if an array of microphones wereused in generating the audio data, since it is then possible to processthe audio data to extract the data for each speaker.

FIG. 6 illustrates the form of the annotation data in such anembodiment, where a first speaker utters the words “. . . this so” andthe second speaker replies “yes”. As illustrated, the annotation datafor the different speakers' audio data are-time synchronised, relativeto each other, so that the annotation data is still time synchronised tothe video and audio data within the data file. In such an embodiment,the header information in the data structure should preferably include alist of the different speakers within the annotation data and, for eachspeaker, data defining that speaker's language, accent, dialect andphonetic set, and each block should identify those speakers that areactive in the block.

In the above embodiments, a speech recognition system was used togenerate the annotation data for annotating a data file in the database.As those skilled in the art will appreciate, other techniques can beused to generate this annotation data. For example, a human operator canlisten to the audio data and generate a phonetic and word transcriptionto thereby manually generate the annotation data.

In the above embodiments, the annotation data was generated from audiostored in the data file itself. As those skilled in the art willappreciate, other techniques can be used to input the annotation data.FIG. 7 illustrates the form of a user terminal 59 which allows a user toinput voice annotation data via the microphone 7 for annotating a datafile 91 which is to be stored in the database 29. In this embodiment,the data file 91 comprises a two dimensional image generated by, forexample, a camera. The user terminal 59 allows the user 39 to annotatethe 2D image with an appropriate annotation which can be usedsubsequently for retrieving the 2D image from the database 29. In thisembodiment, the input voice annotation signal is converted, by theautomatic speech recognition unit 51, into phoneme and word latticeannotation data which is passed to the control unit 55. In response tothe user's input, the control unit 55 retrieves the appropriate 2D filefrom the database 29 and appends the phoneme and word annotation data tothe data file 91. The augmented data file is then returned to thedatabase 29. During this annotating step, the control unit 55 isoperable to display the 2D image on the display 57 so that the user canensure that the annotation data is associated with the correct data file91.

The automatic speech recognition unit 51 generates the phoneme and wordlattice annotation data by (i) generating a phoneme lattice for theinput utterance; (ii) then identifying words within the phoneme lattice;and (iii) finally by combining the two. FIG. 8 illustrates the form ofthe phoneme and word lattice annotation data generated for the inpututterance “picture of the Taj-Mahal”. As shown, the automatic speechrecognition unit identifies a number of different possible phonemestrings which correspond to this input utterance. As shown in FIG. 8,the words which the automatic speech recognition unit 51 identifieswithin the phoneme lattice are incorporated into the phoneme latticedata structure. As shown, for the example phrase, the automatic speechrecognition unit 51 identifies the words “picture”, “of”, “off”, “the”,“other”, “ta”, “tar”, “jam”, “ah”, “hal”, “ha” and “al”. The controlunit 55 is then operable to add this annotation data to the 2D imagedata file 91 which is then stored in a database 29.

As those skilled in the art will appreciate, this embodiment can be usedto annotate any kind of image such as x-rays of patients, 3D videos of,for example, NMR scans, ultrasound scans etc. It can also be used toannotate one-dimensional data, such as audio data or seismic data.

In the above embodiment, a data file was annotated from a voicedannotation. As those skilled in the art will appreciate, othertechniques can be used to input the annotation. For example, FIG. 9illustrates the form of a user terminal 59 which allows a user to inputtyped annotation data via the keyboard 3 for annotating a data file 91which is to be stored in a database 29. In this embodiment, the typedinput is converted, by the phonetic transcription unit 75, into thephoneme and word lattice annotation data (using an internal phoneticdictionary (not shown)) which is passed to the control unit 55. Inresponse to the user's input, the control unit 55 retrieves theappropriate 2D file from the database 29 and appends the phoneme andword annotation data to the data file 91. The augmented data file isthen returned to the database 29. During this annotating step, thecontrol unit 55 is operable to display the 2D image on the display 57 sothat the user can ensure that the annotation data is associated with thecorrect data file 91.

FIG. 10 illustrates the form of the phoneme and word lattice annotationdata generated for the input utterance “picture of the Taj-Mahal”. Asshown in FIG. 2, the phoneme and word lattice is an acyclic directedgraph with a single entry point and a single exit point. It representsdifferent parses of the user's input. As shown, the phonetictranscription unit 75 identifies a number of different possible phonemestrings which correspond to the typed input.

FIG. 11 is a block diagram illustrating a document annotation system. Inparticular, as shown in FIG. 11, a text document 101 is converted intoan image data file by a document scanner 103. The image data file isthen passed to an optical character recognition (OCR) unit 105 whichconverts the image data of the document 101 into electronic text. Thiselectronic text is then supplied to a phonetic transcription unit 107which is operable to generate phoneme and word annotation data 109 whichis then appended to the image data output by the scanner 103 to form adata file 111. As shown, the data file 111 is then stored in thedatabase 29 for subsequent retrieval. In this embodiment, the annotationdata 109 comprises the combined phoneme and word lattice described abovewhich allows the user to subsequently retrieve the data file 111 fromthe database 29 by a voice query.

FIG. 12 illustrates a modification to the document annotation systemshown in FIG. 15. The difference between the system shown in FIG. 16 andthe system shown in FIG. 11 is that the output of the optical characterrecognition unit 105 is used to generate the data file 113, rather thanthe image data output by the scanner 103. The rest of the system shownin FIG. 12 is the same as that shown in FIG. 11 and will not bedescribed further.

FIG. 13 shows a further modification to the document annotation systemshown in FIG. 11. In the embodiment shown in FIG. 13, the input documentis received by a facsimile unit 115 rather than a scanner 103. The imagedata output by the facsimile unit is then processed in the same manneras the image data output by the scanner 103 shown in FIG. 11, and willnot be described again.

In the above embodiment, a phonetic transcription unit 107 was used forgenerating the annotation data for annotating the image or text data. Asthose skilled in the art will appreciate, other techniques can be used.For example, a human operator can manually generate this annotation datafrom the image of the document itself.

In the first embodiment, the audio data from the data file 31 was passedthrough an automatic speech recognition unit in order the generate thephoneme annotation data. In some situations, a transcript of the audiodata will be present in the data file. Such an embodiment is illustratedin FIG. 14. In this embodiment, the data file 81 represents a digitalvideo file having video data 81-1, audio data 81-2 and script data 81-3which defines the lines for the various actors in the video film. Asshown, the script data 81-3 is passed through a text to phonemeconverter 83, which generates phoneme lattice data 85 using a storeddictionary which translates words into possible sequences of phonemes.This phoneme lattice data 85 is then combined with the script data 81-3to generate the above described phoneme and word lattice annotation data81-4. This annotation data is then added to the data file 81 to generatean augmented data file 81′ which is then added to the database 29. Asthose skilled in the art will appreciate, this embodiment facilitatesthe generation of separate phoneme and word lattice annotation data forthe different speakers within the video data file, since the script datausually contains indications of who is talking. The synchronisation ofthe phoneme and word lattice annotation data with the video and audiodata can then be achieved by performing a forced time alignment of thescript data with the audio data using an automatic speech recognitionsystem (not shown).

In the above embodiments, a phoneme (or phoneme-like) and word latticewas used to annotate a data file. As those skilled in the art of speechrecognition and speech processing will realise, the word “phoneme” inthe description and claims is not limited to its linguistic meaning butincludes the various sub-word units that are identified and used instandard speech recognition systems, such as phonemes, syllables,Katakana (Japanese alphabet) etc.

Lattice Generation

In the above description, generation of the phoneme and word latticedata structure shown in FIG. 4 b was described with reference to FIG. 3.A preferred form of that data structure, including a preferred divisionof the nodes into blocks, will now be described with reference to FIGS.15 to 17. Thereafter, one way of generating the preferred data structurewill be described with reference to FIGS. 18 to 22.

FIG. 15 a shows the timing of each node of the lattice relative to acommon zero time, which in the present example is set such that thefirst node occurs at a time of 0.10 seconds. It is noted that FIG. 15 ais merely schematic and as such the time axis is not representedlinearly.

In the present embodiment, the nodes are divided into three blocks asshown in FIG. 15 b. In the present embodiment, demarcation of the nodesinto blocks is implemented by block markers or flags 202, 204, 206 and208. Block markers 204, 206 and 208 are located immediately after thelast node of a block, but are shown slightly spaced therefrom in FIG. 15b for the sake of clarity of the illustration. Block marker 204 marksthe end of block 0 and the start of block 1, similarly block marker 206marks the end of block 1 and the start of block 2. Block marker 208 isat the end of the lattice and hence only indicates the end of block 2.Block marker 202 is implemented at time t=0.00 seconds in order toprovide the demarcation of the start of block 0. In the presentembodiment, block 0 has five nodes, block 1 also has five nodes andblock 2 has seven nodes.

The time of each node is provided relative to the time of the start ofits respective block. This does not affect the timings of the nodes inblock 0. However, for the further blocks the new off-set timings aredifferent from each node's absolute relative timing as per FIG. 15 a. Inthe present embodiment the start time for each of the blocks other thanblock 0 is taken to be the time of the last node of the preceding block.For example, in FIG. 15 a it can be seen that the node between thephonemes /ih/ and /z/ occurs at 0.71 seconds, and is the last node ofblock 1. From FIG. 15 a it can be seen that the next node, i.e. thatbetween the phoneme /z/ and the phoneme /dh/ occurs at a time of 0.94seconds, which is 0.23 seconds after the time of 0.71 seconds.Consequently, as can be seen in FIG. 15 b, the off-set time of the firstnode of block 1 is 0.23 seconds.

The use of time off-sets determined relative to the start of each blockrather than from the start of the whole lattice provides advantages withrespect to dynamic range as follows. As the total time of a latticeincreases, the dynamic range of the data type used to record the timingvalues in the lattice structure will need to increase accordingly, whichwill consume large amounts of memory. This will become exacerbated whenthe lattice structure is being provided for a data file of unknownlength, for example if a common lattice structure is desired to beusable for annotating either a one minute television commercial or afilm or television programme lasting a number of hours. In contrast, thedynamic range of the corresponding data type for the lattice structuredivided into blocks is significantly reduced by only needing toaccommodate a maximum expected time off-set of a single block, andmoreover this remains the same irrespective of the total duration of thedata file. In the present embodiment the data type employed providesinteger values where each value of the integer represents the off-settime measured in hundredths of a second.

FIG. 15 b also shows certain parts of the lattice structure identifiedas alpha (α) and beta (β). The significance of these items will beexplained later.

The format in which the data is held for each respective node in thepreferred form of the phoneme and word lattice data structure will nowbe explained with reference to FIG. 16 a, which shows by way of examplethe format of the data for the first node of the lattice. The data forthis particular node is in the form of seven data components 210, 212,214, 216, 218, 220 and 222.

The first data component 210 specifies the time off-set of the node fromthe start of the block. In the present example, the value is 0.10seconds, and is implemented by means of the integer data type describedearlier above.

The second data component 212 represents the word link “NOW”, which isshown in FIGS. 15 a and 15 b extending from the first node. The thirddata component specifies the nodal off-set of the preceding link, i.e.the word link “NOW”, by which is meant the number of nodes the precedinglink extends by. Referring to FIGS. 15 a and 15 b, it can be seen thatthe node to which the word link “NOW” extends is the third node alongfrom the node from which the link extends, hence the nodal off-set is 3,as represented illustratively in FIG. 16 a by the value 003. In thepresent embodiment the data type employed to implement the nodal off-setvalues is again one providing integer values.

The fourth data component 216 represents the phoneme /n/ which extendsfrom the first node to the second node, entailing therefore a nodaloff-set of one which leads directly to the value 001 for the fifth datacomponent 218 as shown in FIG. 16 a. Similarly the sixth data component220 represents the phoneme link /m/, and the seventh data component 222shows the nodal off-set of that link which is equal to 1 and representedas 001.

The manner in which the data components 212, 216 and 220 represent therespective word or phoneme associated with their link can be implementedin any appropriate manner. In the present embodiment the data components212, 216 and 220 consist of an integer value which corresponds to a wordindex entry value (in the case of a word link) or a phoneme index entryvalue (in the case of a phoneme link). The index entry value serves toidentify an entry in a corresponding word or phoneme index containing alist of words or phonemes as appropriate. In the present embodiment thecorresponding word or phoneme index is held in the header part of theannotation data 31-3 described earlier. In other embodiments the headermay itself only contain a further cross-reference identification to aseparate database storing one or more word or phoneme indices.

Generally, the different links corresponding to a given node can beplaced in the data format of FIG. 16 a in any desired relative order. Inthe present embodiment, however, a preferred order is employed in whichthe word or phoneme link with the largest nodal off-set, i.e. the“longest” link, is placed first in the sequence. Thus, in the presentcase, the “longest” link is the word link “NOW” with a nodal off-set ofthree nodes, and it is therefore placed before the “shorter” phonemelinks /n/ and /m/ which each only have a nodal off-set of 1. Advantagesof this preferred arrangement will be explained later below.

The data for each node, in the form shown in FIG. 16 a, is arranged in atime ordered sequence to form a data stream defining the whole lattice(except for the header). The data stream for the lattice shown in FIG.15 b is shown in FIG. 16 b. As shown, the data stream additionallyincludes data components 225 to 241 serving as node flags to identifythat the data components following them refer to the next respectivenode. The data stream also includes further data components 244, 246,248 and 250 implementing respectively the block markers 202, 204, 206and 208 described earlier above with respect to FIG. 15 b.

Earlier, with reference to FIG. 4 b, a first advantage of the blockarrangement of the present lattice data structure was described, namelythat it allows the search to jump into the middle of the annotation datafor a given audio data stream. For this reason the header, alsodescribed with reference to FIG. 4 b, includes a time index whichassociates the location of the blocks of annotation data within thememory to a given time offset between the time of start and the timecorresponding to the beginning of the block. As is described above withrespect to FIG. 15 b, the time corresponding to the beginning of a givenblock is, in the present embodiment, the time of the last node of theblock which precedes the given block.

The block arrangement shown in FIG. 15 b displays however furthercharacteristics and advantages, which will now be described. The blocksare determined according to an extent to which word or phoneme links arepermitted to extend between blocks. For example, in the presentembodiment, the block positions implement a criteria that no link mayextend into any other block other than its directly neighbouring block.Considering the nodes of block 0, for example, it can be seen from FIG.15 b that the phoneme links /n/, /m/, /oh/, /w/ and /ih/ and word link“NOW” only extend within the same block in which their source nodes arelocated, which is allowed by the criteria, and the phoneme link /z/ andthe word link “IS” each extend from block 0 into block 1, i.e. into thedirectly neighbouring block, which is also allowed by the criteria.However, there are no links extending into block 2, because such linkswould have to extend beyond the directly neighbouring block of block 0(i.e. block 1) and hence are not allowed by the criteria.

By virtue of the blocks being implemented so as to obey the abovedescribed criteria, the following advantages are achieved. If furtherdata is later to be inserted into the phoneme and word latticestructure, this may involve the insertion of one or more additionalnodes. In this event, any existing link “passing over” a newly insertednode will require its nodal off-set to be increased by one, as the newlyinserted node will need to be included in the count of the number ofnodes over which the existing link extends. For example, if a new nodewere inserted at a time of 0.50 seconds into block 2, then it can beseen from FIG. 15 b that the phoneme link /v/ extending from the node at0.47 seconds to the node at 0.55 seconds would then acquire a nodaloff-set value of 2, rather than its original value of 1, and similarlythe word link “OF” extending from the node at 0.34 seconds to the nodeat 0.55 seconds would have its original nodal off-set value of 2increased to a nodal off-set of 3. Expressed in terms of the data streamshown in FIG. 16 b, the data component 252 originally showing a value of001 would need to be changed to a value of 002, and the data component254 whose original value is 002 would need to have its value changed to003.

During insertion of such additional nodes and processing of theconsequential changes to the nodal off-sets, it is necessary to searchback through the lattice data structure from the point of the newlyinserted node in order to analyse the earlier existing nodes todetermine which of them have links having a nodal off-set sufficientlylarge to extend beyond the newly inserted node. An advantage of theblocks of the lattice data structure being arranged according to thepresent criteria is that it reduces the number of earlier existing nodesthat need to be analysed. More particularly, it is only necessary toanalyse those nodes in the same block in which the node is insertedwhich precede the inserted node plus the nodes in the neighbouring blockdirectly preceding the block in which the new node has been inserted.For example, if a new node is to be inserted at 0.50 seconds in block 2,it is only necessary to analyse the four existing nodes in block 2 thatprecede the newly inserted node plus the five nodes of block 1. It isnot necessary to search any of the nodes in block 0 in view of the blockcriteria discussed above.

This advantage becomes increasingly beneficial as the length of thelattice increases and the number of blocks formed increases.Furthermore, the advantage not only applies to the insertion of newnodes into an otherwise complete lattice, it also applies to the ongoingprocedure of constructing the lattice, which may occur when nodes arenot necessarily inserted into a lattice in strict time order.

Yet further, it is noted that the particular choice of the criteria toonly allow links to extend into a neighbouring block may be varied, forexample the criteria may allow links extending only as far as fourblocks away, it then being necessary to search back only a maximum offour blocks. This still provides a significant advantage in terms ofreducing the level of processing required in the case of large lattices,particularly lattices with hundreds or thousands of blocks. The skilledpractitioners will appreciate that any appropriate number of blocks canbe chosen as the limit in the criteria, it merely being necessary tocommensurately adapt the number of blocks that are searched backthrough.

The lattice data structure of the present embodiment contains a furtherpreferred refinement which is also related to the extension of the wordor phoneme links into neighbouring blocks. In particular the latticedata structure further includes data specifying two characteristicpoints of each block. The two characteristic points for each block areshown as alpha (α) and beta (β) in FIG. 15 b.

Beta for a given block is defined as the time of the latest node in thegiven block to which any link originating from the previous blockextends. Thus, in the case of block 1, beta is at the first node in theblock (i.e. the node to which the phoneme link /z/ and the word link“IS” extend), since there are no links originating in block 0 thatextend further than the first node of block 1. In the case of block 2,beta is at the third node, since the word link “WINTER” extends to thatnode from block 1. In the case of the first block of the latticestructure i.e. block zero, there are intrinsically no links extendinginto that block. Therefore, beta for this block is defined as occurringbefore the start of the lattice.

Alpha for a given block is defined as the time of the earliest node inthe given block from which a link extends into the next block. In thecase of block 0, two links extend into block 1, namely word link “IS”and the phoneme link /z/. Of these, the node from which the word link“IS” extends is earlier in block 0 than the node from which the phonemelink /z/ extends, hence alpha is at the node from which the word link“IS” extends. Similarly, alpha for block 1 is located at the node wherethe word link “WINTER” originates from. In the case of the last block ofthe lattice, in this case block 2, there are intrinsically no linksextending into any further block, hence alpha is specially defined asbeing at the last node in the block. Thus it can be appreciated thatconceptually beta represents the latest point in a block before whichthere are nodes which interact with the previous block, and alpharepresents the earliest point in a block after which there are nodeswhich interact with the next block.

As those skilled in the art will appreciate, each alpha and beta can bespecified by identification of a particular node or by specification interms of time. In the present embodiment identification is specified bynodes. The data specifying alpha and beta within the lattice datastructure can be stored in a number of different ways. For example, datacomponents of the type shown in FIG. 16 b can be included containingflags or markers at the relevant locations within the data stream.However, in the present embodiment the points are specified by storingthe identities of the respective nodes in a look-up table in the headerpart of the lattice data structure.

The specification of alpha and beta for each block firstly providescertain advantages with respect to analysing the nodal off-sets ofprevious nodes in a lattice when a new node is inserted. In particular,when a new node is inserted at a location after beta in a given block,it follows that it is only necessary to analyse the preceding nodes inthe given block, and it is no longer necessary to analyse the nodes inthe block preceding the given block. This is because it is already knownthat by virtue of the new inserted node being after beta within thegiven block, there can by definition be no links that extend from theprevious block beyond the newly inserted node, since the position ofbeta defines the greatest extent which any links extend from theprevious block. Thus the need to search and analyse any of the nodes ofthe preceding block has been avoided, which becomes particularlyadvantageous as the average size of blocks increases. If alternatively anew node is inserted into a given block at a location before beta of thegiven block, then it is now necessary to consider links originating fromthe preceding block as well, but only those nodes at or after alpha inthe preceding block. This is due to the fact that from the definition ofalpha, it is already known that none of the nodes in the preceding blockthat come before the preceding block's alpha have links which extendinto the given block. Thus processing is again reduced, and thereduction will again become more marked as the size of individual blocksis increased. Moreover, the position of alpha in any given block willtend to be towards the end of that block, so that in the case of longblocks the majority of the processing resource that would otherwise havebeen used analysing the whole of the preceding block is saved.

The specification of alpha and beta for each block secondly providescertain advantages with respect to employing alpha and beta inprocedures to re-define blocks within an existing lattice so as toprovide smaller or more evenly arranged blocks whilst maintainingcompliance with the earlier mentioned criteria that no link may extendfurther than one block. In these procedures, existing blocks areessentially split, according to the relative position of alpha and betawithin an existing block. In one approach, provided alpha occurs afterbeta within a given block, the given block can be divided into twoblocks by splitting it somewhere between beta and alpha. Similarly, thedata specifying beta and alpha is advantageously employed to determinewhen existing blocks can be split into smaller blocks in the course of apreferred procedure for constructing the lattice data structure.

It was mentioned earlier above that in the present embodiment thelongest link from a given node is positioned first in the sequence ofdata components for any given node as shown in FIG. 16 a. This isadvantageous during the procedure of inserting a new node into thelattice data structure, wherein previous nodes must be analysed todetermine whether any links originate from them that extend beyond thenewly inserted node. By always placing the longest link that extendsfrom any given node at a particular place in the sequence of datacomponents for that node, in the present case at the earliest placewithin the sequence, if that link is found not to extend over the newlyinserted node then it is not necessary to analyse any of the remaininglinks in the sequence of data components for that node, since they willby definition be of shorter span than the already analysed longest link.Hence further processing economy is achieved.

A preferred method of generating the above described lattice datastructure will now be described with reference to FIGS. 17 to 19. Inthis preferred method the constituent data is organised into sets ofdata components, and the sets of data components are added one at a timeto the lattice structure as it is built up. Each set of data componentsconsists of either:

-   (i) two new nodes plus any links directly therebetween (in the case    of adding nodes to the lattice which are not to be connected to    nodes already in the lattice); or-   (ii) a new node plus each of the links that end at that node; or-   (iii) a link between existing nodes within the lattice.

FIG. 17 is a flow diagram which illustrates the process steps employedin the preferred method. In the following explanation of the processsteps of FIG. 17, the application of the steps to the construction ofthe lattice of FIG. 15 b will be demonstrated, and will thus serve toshow how the method operates when applied to input data in which thenodes are already fully time sequentially ordered. Thereafter, the wayin which the process steps are applied (be it to the construction of anew lattice or to the alteration of an existing lattice) when additionalnodes are to be inserted into an existing time ordered sequence of nodeswill be described by describing various different additions of data tothe lattice data structure of FIG. 15 b.

In overview, as each set of data components is added to the lattice, thevarious ends of blocks, alphas and betas are updated. When the number ofnodes in a block reaches a critical value, in this example 9, thelocations of alpha and beta are analysed and if suitable the block issplit into two smaller blocks. The various alphas and betas are againupdated, and the process then continues in the same manner with theaddition of further data components.

The process steps laid out in FIG. 17 will now be explained in detail.Reference will also be made to FIGS. 18 a to 18 h which show the buildup of the lattice structure in the graphical representation form of FIG.15 b. Additional reference will be made to FIGS. 19 a to 19 h which showthe progress of the construction of the data stream defining thelattice, corresponding to the form of FIG. 16 b.

Referring to FIG. 17, at step S61 the automatic speech recognition unit33 defines the start of the first block, i.e. block zero. In FIG. 18 athe block marker defining the starter of the first block is indicated byreference number 202. This is implemented in the data stream byinsertion of data component 244 (see FIG. 19 a) consisting of a blockflag.

At step S63 the automatic speech recognition unit 33 sets an incrementalcounter n equal to 1.

At step S65 the automatic speech recognition unit 33 inserts the firstset of data components into the data stream defining the lattice datastructure. More particularly, the automatic speech recognition unit 33collects the data corresponding to the first two nodes of the latticeand any direct phoneme links therebetween (in this case phoneme links/n/ and /m/). It then additionally collects any words that have beenidentified by the word decoder 37 as being associated with a linkbetween these two nodes, although in the case of the first two nodes nosuch word has been identified. It then inserts the corresponding datacomponents into the data stream. In particular, referring again to FIG.19 a, data 260 defining the first node of the lattice structure, andbeing made up of a data component consisting of a node flag and a datacomponent indicating the time of the node, is inserted. Thereafter data262 comprising the data component consisting of the phoneme link /n/ andthe nodal off-set value of 001 is inserted, followed by data 264comprising a data component consisting of the phoneme /m/ and nodaloff-set value 001. Finally, data 266 comprising the data componentconsisting of a node flag and the data component consisting of the timeof that second node is inserted. Thus all of the component parts 260,262, 264, 266 of the first set of data components are inserted. Thefirst two nodes and the phoneme links /n/ and /m/ therebetween can beseen in FIG. 18 a also. At step S67 the automatic speech recognitionunit 33 determines whether any new nodes have been included in the newlyinserted set of data components. The answer in the present case is yes,so the process moves on to step S69 where the automatic speechrecognition unit determines whether any of the new nodes are nowpositioned at the end of the current data lattice structure. The answerin the present case is again yes. In fact, when the method shown in theflow chart of FIG. 17 is used to construct a data lattice from data inwhich the nodes are ordered in a time sequential manner, as in thepresent case, the answers to the determination steps S67 and S69 willinherently always be positive. These determination steps are onlyincluded in the flow chart to illustrate that the process is capable ofaccommodating additional nodes or links to be inserted within thelattice when required (examples of these cases will be given laterbelow).

In the present case, the process then moves on to step S71, where theautomatic speech recognition unit 33 defines the end of the last blockto be immediately after the newly inserted node which is at the end ofthe lattice. At this stage of the procedure there is only one block,hence in defining the end of the last block, the end of the sole blockis in fact defined. This newly defined current end of the block is shownas item 203 in FIG. 18 a, and is implemented in the data stream as datacomponent 245 consisting of a block flag, as shown in FIG. 19 a.

The automatic speech recognition unit 33 then determines all of thealpha and beta points. At the present stage there is only one block soonly one alpha and one beta is determined. The procedure for determiningalpha and beta in the first block was described earlier above. Theresulting positions are shown in FIG. 18 a. With respect to the datastream, the alpha and beta positions are entered into the header data,as was described earlier above.

As step S79 the automatic speech recognition unit 33 determines whetherany of the alpha and beta values are “invalid”, in the sense of beingeither indeterminate or positioned such as to contravene the earlierdescribed criteria that no link may extend further than into a directlyneighbouring block. At the present stage of building up the lattice thisdetermination step obviously determines that there is no suchinvalidity, and hence the process moves to step S81. At step S81 theautomatic speech recognition unit determines whether the number of nodesin any blocks that have just had nodes inserted in them has reached orexceeded a predetermined critical number. The predetermined criticalnumber is set for the purpose of defining a minimum number of nodes thatmust be in a block before the block structure will be analysed oraltered for the purposes of giving smaller block sizes or more evenblock spacings. There is an effective overhead cost in terms ofresources that are required when carrying out block division, datastorage of the block flag data, and so on. Hence block division forblocks containing less than the critical number of nodes would tend tobe counter productive. The choice of the value of the critical numberwill depend on the particular characteristics of the lattice or datafile being considered. As mentioned above, in the present embodiment thenumber is set at nine. Hence at the present stage of the process, whereonly two nodes have been inserted in total, the answer to thedetermination step S81 is no.

The process steps are thus completed for the first set of datacomponents to be inserted, and the current form of the lattice and datastream is shown in FIGS. 18 a and 19 a.

The procedure then moves to step S89, where the automatic speechrecognition unit determines that more sets of data components are to beadded, and hence at step S91 increments the value of n by one and theprocess steps beginning at steps S65 are repeated for the next set ofdata components. In the present case the next set of data componentsconsists of data (item 270 in FIG. 19 b) specifying the third node ofthe lattice and its time of 0.41 seconds and data (item 268 in FIG. 19b) specifying the phoneme link /oh/ plus its nodal off-set value of 001.The phoneme link /oh/ and third node are shown having been inserted inFIG. 18 b also. At step S71, the end 203 of the block, being defined asafter the last node, is therefore now positioned as shown in FIG. 18 b,and is implemented in the data stream by the data component 245,consisting of a block flag, now being positioned after the newlyinserted data 268 and 270. The new position of alpha, now at the new endnode, as determined at step S75, is shown in FIG. 18 b. At step S79 itis again determined that there is no invalid alpha or beta, and becausethe number of nodes is only three (i.e. less than nine) processing ofthis latest set of data components is now complete, so that the latticeand data stream are currently as shown in FIGS. 18 b and 19 b.

As the procedure continues, the fourth node and the two links which endat that node, namely the phoneme link /w/ and the word link “NOW”,representing the next set of data components, are inserted. The processsteps from S65 onwards are followed as described for the previous setsof data components, resulting in the lattice structure shown in FIG. 18c and the data stream shown in FIG. 19 c. It can be seen in FIG. 19 cthat the data 272 corresponding to the phoneme link /w/ and the data 274corresponding to the latest node is just before the last block flag atthe end of the data stream, whereas the data 276 corresponding to theword link “NOW” is placed in the data stream with the node from whichthat link extends, i.e. the first node. Moreover it is placed before theother links that extend from the first node, namely the phoneme links/n/ and /m/ because their nodal off-set values are 001 which are lessthan the value of 003 for the word link “NOW”.

The procedure continues as described above without variation for theinsertion of the fifth, sixth, seventh and eighth nodes providing thelattice structure and data stream shown in FIGS. 18 d and 19 drespectively. On the next cycle of the procedure starting at step S65,the set of data components inserted is the ninth node and the phonemelink /w/ ending at that node. Following implementation in the samemanner as above of the steps S67, S69, S71 and S75, the latticearrangement is as shown in FIG. 18 e-1, with the end 203 of the blocklocated after the newly inserted ninth node, and alpha located at thatninth node. At step S79 the automatic speech recognition unit determinesthat there is no invalidity of the alpha and beta values and so theprocess moves on to step S81. The procedure to this point has followedthe same as for the previous sets of data components. However, sincethis time the newly inserted node brings the total number of nodes inthe sole block up to nine, when the automatic speech recognition unitcarries out the determination step S81 it determines for the first timethat the number of nodes in the block is indeed greater than or equal tonine. Consequently, this time the procedure moves to step S83, where theautomatic speech recognition unit determines whether alpha is greaterthan beta, i.e. whether alpha occurs later in the block than beta. Thisis determined in the present example to be the case (in fact this willalways be the case for the first block of a lattice due to the way betais defined for the first lattice).

It can thus be appreciated that the basic approach of the present methodis that when the number of nodes in a block reaches nine or more, theblock will be divided into two blocks, provided that alpha is greaterthan beta. The reason for waiting until a certain number of nodes hasbeen reached is due to the cost in overhead resource, as wasexplained-earlier above. The reason for the criteria that alpha begreater than beta is to ensure that each of the two blocks formed by thedivision of an original block will obey the earlier described criteriathat no link is permitted to extend into any block beyond a directlyneighbouring block.

Therefore, in the present case, the procedure moves to step S85 in whichthe automatic speech recognition unit splits the sole block of FIG. 18e-1 into two blocks. This is carried out by defining a new end of block205 which is positioned according to any desired criteria specifying aposition somewhere between beta and alpha. In the present embodiment thecriteria is to insert the new end of block equally spaced (in terms ofthe number of nodes, rounded up where necessary) between beta and alpha.Thus, the block is split by insertion of a new end of block 205immediately after the fifth node, as shown in FIG. 18 e-2. This isimplemented in the data stream by the insertion of data component 298,consisting of a block flag, as shown in FIG. 19 e. Additionally, theautomatic speech recognition unit 33 recalculates the times of all ofthe nodes in the newly formed second block as off-sets from the starttime of that block, which is the time of the fifth node of the wholelattice (0.71 seconds). Hence the resulting data stream, shown in FIG.19 e, now contains the newly inserted data component 298, newly inserteddata 300 relating to the phoneme link /w/ and newly inserted data 302relating to the end node. Morever, the data components 304, 306, 308 and310 have had their time values changed to new off-set values.

At step S87 updated values of alpha and beta are determined by theautomatic speech recognition unit. Given there are now two blocks, thereare two betas and two alphas to be determined. The new locations ofthese alphas and betas are shown in FIG. 18 e-2.

The procedure of FIG. 17 thereafter continues as described above for theinsertion of the tenth through to thirteenth node of the overall latticewithout the critical number of 9 nodes yet being reached in block 1.This provides the lattice structure and data stream shown in FIGS. 18 fand 19 f respectively.

The next set of data components inserted consists of the fourteenth nodeand the phoneme link /oh/ ending at that node. The situation after stepsS65 to S79 are implemented for this set of data components is shown inFIG. 18 g-1. Insertion of this latest set of data components has broughtthe number of nodes in the second block up to nine, and alpha is afterbeta. Consequently, the automatic speech recognition unit 33 carries outstep S85 in which it inserts a new end of block 207 immediately afterthe fifth node of the block to be split, as shown in FIG. 18 g-2. Thisis implemented in the data stream by insertion of data component 330consisting of a new block flag, as shown in FIG. 19 g. The automaticspeech recognition unit 33 also calculates the adjusted off-set times(334,336,338,340 in FIG. 19 g) of the nodes in the newly formed thirdblock. Thereafter, at step S87, the automatic speech recognition unitdetermines updated values of the alphas and betas, which provides a newalpha for what is now the second block and a new beta for what is nowthe third block, both of which are also shown in FIG. 18 g-2.

The procedure shown in FIG. 17 is repeated for the remaining three setsof data components yet to be added, so providing the lattice structureand data stream shown in FIGS. 18 h and 19 h.

At this stage, the automatic speech recognition unit 33 determines atstep S89 that no more sets of data components are available to beinserted, and hence the current lattice data structure is complete, andindeed corresponds to the lattice shown in FIGS. 15 b and 16 b.

An example will now be given to demonstrate the merging of two blocksdue to the later insertion of a long link that extends beyond aneighbouring block. This situation did not arise in the earlier examplebecause the data was added into the lattice on a fully time orderedsequential basis. In contrast, in the following example, after thelattice of FIG. 15 b has reached the stage described so far, anadditional link is required to be inserted between certain existingnodes. There are a number of reasons why this might occur. Onepossibility is that the lattice has been completed earlier, thenemployed as annotation data, but at a later date needs revision. Anotherpossibility is that all the phoneme data is processed first, followed byall the word data, or vice-versa. Yet another possibility is that thedata from different soundtracks, e.g. different speakers, is separatelyadded to provide a single lattice.

However, in the present example, the insertion of the earlier timed linkis essentially part of the original on-going construction of thelattice, although the data component consisting of the additional linkis processed separately at the end because it constitutes a wordrecognised by the automatic speech recognition unit 33 when passing thephoneme data through a second speech recognition vocabulary. In thepresent example, the second vocabulary consists of a specialised nameplace vocabulary that has been optionally selected by a user. Hence, inthe present example, at step S89 it is determined that a further set ofdata components is to be inserted, and following incrementing of thevalue of n at step S91, the data is inserted at step S65. The dataconsists of the word link “ESTONIA” and extends from the fourth node ofblock 0 to the third node of block 2, as shown in FIG. 20 a.

At step S67 the automatic speech recognition unit 33 recognises that nonew node has been inserted, hence the process moves to step S75 where itdetermines updated locations of alpha and beta. However, because thenewly inserted link extends from block 0 right over block 1 to end inblock 2, it contravenes the earlier described criteria barring linkextensions beyond directly neighbouring blocks, and moreover does notproduce a valid alpha or beta for block 1. This is represented in FIG.20 a by the indication that any alpha for block 1 would in fact need toappear in block 0, and any beta for block 1 would need to appear inblock 2. Consequently, at the next step S79, it is determined that alphaand beta are indeed invalid.

The procedure therefore moves to step S77 which consists of mergingblocks. Any suitable criteria can be used to choose which blocks shouldbe merged together, for example the criteria can be based on providingthe most evenly spaced blocks, or could consist of merging the offendingblock with its preceding block. However, in the present embodiment thechoice is always to merge the offending block with its following block,i.e. in the present example block 1 will be merged with block 2.

This is implemented by removal of the block marker dividing block 1 fromblock 2, resulting in two blocks only, as shown in FIG. 20 b. Theprocedure then returns to step S75, where the alphas and betas aredetermined again. The resulting positions of alpha and beta are shown inFIG. 20 b.

At step S79 the automatic speech recognition unit 33 determines thatalpha and beta are now valid, so the procedure moves to step S81. In thepresent example, because there are now twelve nodes in block 1 andbecause alpha is greater than beta, the procedure moves to step S85 andblock 1 is split using the same procedure as described earlier above.However, the earlier employed criteria specifying where to locate thenew block division, namely half way in terms of nodes between beta andalpha, contains in the present example a refinement that when the blockto be split has greater than nine nodes, splitting should, wherepossible, leave the earlier of the two resulting blocks with no morethan eight nodes. This is to avoid inefficient repetitions of the blocksplitting process. Hence in the present example the new block marker isinserted immediately after the eighth node of the block being split, asshown in FIG. 20 c. At step S87 the alphas and betas are againdetermined, the new positions being shown in FIG. 20 c. It is noted thatalpha and beta both occur at the same node of block 1. In the presentexample it is determined at step S89 that no more sets of datacomponents are to be added, and hence the procedure is completed.

In the above procedure described with reference to FIGS. 20 a to 20 c,the changes to the lattice are implemented by changes to the data streamof FIG. 16 b in corresponding fashion to the earlier examples. Inparticular, step S77 of merging the two blocks is implemented by removalof the relevant data component 248 containing the original block flagdividing the original block 1 and 2.

A further example demonstrating the processing of data according to theprocedure laid out in the flow chart of FIG. 17 will now be describedwith reference to FIGS. 21 a to 21 d. In this example, additional datacomponents are added immediately after the seventeenth node has beenadded to the lattice of FIG. 15 c. Therefore at step S89 of FIG. 17further components are indeed to be added and the procedure returnsagain via increment step S91 to insertion step S65. However, the methodsteps employed to add the additional data components in the followingexample also constitute a stand alone method of updating or revising anysuitable original lattice irrespective of how the original latticeitself was formed.

In this further example, additional data is added via a keyboard and aphonetic transcription unit, of the same form as the keyboard 3 andphonetic transcription unit 75 shown in FIG. 9. In this further examplethe output of the phonetic transcription unit is connected to theautomatic speech recognition unit 33. The user uses this arrangement toenter annotation data which he intends to correspond to a specificportion of the video data 31-1. Such data is sometimes referred to inthe art as “metadata”. The specific portion of the video data may show,for example, a number of profile shots of an actor, which the userwishes to be able to locate/retrieve at a later date as desired by usingthe annotation data. Hence, he enters the words “PROFILE A B C D E” andmoreover specifies that only word links, not phoneme links, should betranscribed. This provides the following data components to be added:

-   (i) a first new node, a second new node, and a word link “PROFILE”    therebetween;-   (ii) a third new node, and the word link “A” between the new second    and third nodes;-   (iii) a fourth new node, and the word link “B” between the new third    and fourth nodes;-   (iv) a fifth new node and the word link “C” between the new fourth    and fifth nodes;-   (v) a sixth new node and the word link “D” between the new fifth and    sixth nodes; and-   (vi) a seventh new node and the word link “E” between the new sixth    and seventh nodes.

Referring again to FIG. 17, at step S65 data component (i) as describedabove is inserted by the automatic speech recognition unit 33 into thelattice of FIG. 15 b, in the position shown in FIG. 21 a. At step S67,the automatic speech recognition unit 33 determines that new nodes havebeen inserted. At step S69 the automatic speech recognition unitdetermines that neither of the new nodes have been inserted at eitherthe start or the end of the lattice. In other words, the new nodes havebeen inserted within an existing lattice, and hence it will probably benecessary to adjust the nodal off-sets of one or more existing nodes ofthe lattice. The procedure therefore moves to step S73, in which theautomatic speech recognition unit 33 carries out such necessaryadjustment of the nodal off-sets of existing nodes. Any appropriatemethod of adjusting the off-sets can be employed at step S73. In thepresent embodiment a preferred method is employed, and this will bedescribed in detail later below with reference to the flow chart of FIG.22.

Following adjustment of the off-sets, the procedure of FIG. 17 isfollowed in the manner described above for the earlier examples,returning to step S65 for insertion of data component (ii). Theprocedure described above with respect to data component (i) is thenrepeated for data components (ii) and (iii). FIG. 21 b shows the stagereached when data components (i), (ii) and (iii) have been inserted andthe procedure has reached step S81. At this stage, for the first timeduring this insertion of additional data components, it is nowdetermined that the number of nodes in the second block equals 9. Henceat step S83 the automatic speech recognition unit 33 splits the blockand at step S87 determines the new alphas and betas, resulting in thenew block structure shown in FIG. 21 c. It is noted that the criteriaemployed for locating the new block end is one in which the size of thenewly formed second block is made as large as possible except thatplacing the end of the block at alpha itself is not allowed.

The procedure then continues in the same fashion resulting in theinsertion of data components (iv), (v) and (vi) up to reaching step S81during processing of data component (vi). At this stage, the lattice isof the form shown in FIG. 21 d, i.e. nine nodes are now located inpresent block 2, and hence the outcome of step S81 is that the procedureagain moves to step S83. It is noted that the present example has thrownup a situation in present block 2 where beta occurs after alpha, inother words the longest link extended into block 2 extends beyond thestart of the earliest link exiting that block 2, as can be seen in FIG.21 d. If block 2 were to be split in such circumstances, this wouldinherently involve forming a new block that contravenes the basiccriteria of the present embodiment that no link is allowed to extendinto any other blocks other than its directly neighbouring block.Because of this, the method of FIG. 17 does not allow splitting of block2 despite it having nine nodes, and this is implemented by the outcomeof determination step S83 being that alpha is not greater than betaleading to the procedure moving directly on to step S89. In the presentexample it is determined at step S89 that no more sets of datacomponents are to be added, and hence the procedure ends.

The above-mentioned preferred procedure for implementing step S73 ofadjusting the off-sets will now be described with reference to the flowchart of FIG. 22, which shows the procedure followed for each newlyinserted node. The preferred method uses the fact that the location ofalpha and beta in each block is known. The automatic speech recognitionunit 33 analyses nodes preceding the newly inserted node, to determineany links that extend from those nodes beyond the location of the newlyinserted node. If any such node is found, then any affected link needsto have its nodal off-set value increased by one, to accommodate thefact that the newly inserted node is present under its span. If thenewly inserted node is positioned after beta within a given block, thenonly those nodes before the newly inserted node and within the samegiven block need be analysed, since there are inherently no linksextending from the previous block beyond beta. Alternatively, if a newlyinserted node is positioned before beta in the given block, then thenodes before the newly inserted node in that given block need to beanalysed plus the nodes in the preceding block, but only so far back asto include the node corresponding to alpha. The nodes positioned beforealpha of the preceding block do not need to be analysed becauseinherently there are no links extending from before alpha into the blockin which the new node has been inserted.

The above procedure is implemented by the process steps shown in FIG.22. At step S101 the automatic speech recognition unit 33 sets anincrement counter to the value i=1. The increment counter is used tocontrol repeated application, as required, of the procedure toconsecutive earlier nodes on a node-by-node basis. At step S103 the nodewhich is positioned one place before the inserted node is identified.Referring to FIG. 21 a, in the case of the newly inserted node fromwhich the word link “PROFILE” extends, the identified node one positionbefore it is the node from which the word link “THE” extends. At stepS105, all the links extending from the identified node are identified,being here the word link “THE” and the phoneme link /dh/. The automaticspeech recognition unit 33 determines the nodal off-set value of theselinks, which is 002 for the word link “THE” and 001 for the phoneme link/dh/, and hence at step S107 increases each of these nodal off-setvalues by one, to the new values of 003 and 002 respectively. At stepS109 it is determined whether the newly inserted node was positionedbefore beta. In the present case it was actually positioned after, henceanalysis of the nodes need only continue back to the first node of thepresent block, and hence at step S111 it is determined whether thecurrently identified node, i.e. the node that has just had its nodaloff-sets changed, is the first node of the present block. In the presentcase it is, and since no further nodes need to have their off-setsadjusted, the procedure ends. If, however, further nodes remained to beprocessed in the present block, then the procedure would continue tostep S113 where the value of i is incremented, and then the procedurewould be repeated for the next previous node starting from step S103.Also, if in the above example the newly inserted node was in factlocated before beta, then the procedure would be continued on until eachnode up to the node corresponding to alpha in the preceding block hadbeen processed. In order to achieve this, when the inserted node isindeed before beta then the procedure moves to step S115 where theautomatic speech recognition unit determines whether the identified nodeis at the position of alpha of the preceding block. If it is then theprocedure is complete. If it is not, then the procedure moves to stepS117 where the value of i is incremented, and then the procedure isrepeated from step S103.

An alternative way of splitting a block will now be described. When thenumber of nodes in a given block has reached the critical number andalpha is later than beta for the given block, then the given block andthe preceding block are adjusted to form three new blocks in place ofthose two blocks. This procedure will now be described more fully withreference to FIGS. 23 a and 23 b.

FIG. 23 a shows a sequence of nodes within a lattice, linked by phonemelinks for example phoneme link 412, the end part of a word link 414 anda further word link 416. The nodes are divided into blocks by blockmarkers 402, 404 and 406, forming blocks n and (n+1) of the lattice.

The positions of alpha and beta for block n and block (n+1) respectivelyare shown also. FIG. 23 a shows the state of the lattice after the datarepresented by phoneme link 413 and the two nodes between which itextends has been inserted. The number of nodes in block (n+1) has nowreached nine, and since also alpha is later than beta, blockrearrangement is now implemented. The two blocks of FIG. 23 a arereplaced by three blocks, namely block n, block (n+1) and block (n+2),as shown in FIG. 23 b. This is implemented by deleting the block divider404, and replacing it with two new block dividers 408 and 410 placedimmediately after beta of block n and beta of block (n+1) respectively.Alpha and beta for each block is thereafter re-calculated and the newpositions are shown in FIG. 23 b. This procedure for rearranging theblocks provides particularly evenly spaced blocks. This is particularlythe case when a given block has the required number of nodes forsplitting and its alpha is after beta, yet in the block preceding itbeta is positioned after alpha. It is noted that this was indeed thecase in FIG. 23 a. Because of this, in the preferred embodiment, blocksplitting is carried out by this procedure of forming a new blockbetween the two beta positions when beta is positioned after alpha inthe relevant preceding block, but block splitting follows the originallydescribed procedure of dividing the present block between alpha and betawhen beta is positioned before alpha in the preceding block.

In an alternative version of the embodiments described in the precedingparagraph, the two new block dividers may be positioned at nodesrelatively close, compared to the number of nodes in each block, to theposition of beta of block n and beta of block (n+1) respectively,instead of at those two beta positions as such.

In the above embodiments, the timing of each node of the lattice isprovided, prior to arrangement in blocks, relative to a common zero timeset such that the first node occurs at a time of 0.10 seconds. The starttime for the first block is set equal to the common zero time. The starttime for each of the other blocks is the time of the last node of thepreceding block. However, in an alternative embodiment the timing ofeach node may be provided in an absolute form, and the block markerdemarcating the start of each block may be given a Universal StandardTime (UST) time stamp, corresponding to the absolute time of the firstnode of that block rounded down to the nearest whole second. The USTtime stamp may be implemented as a 4 byte integer representing a countof the number of seconds since 1 Jan. 1, 1970. The times of the nodes ineach block are then determined and stored as offset times relative tothe rounded UST time of the start of the block. Because in thisembodiment each block time is rounded to the nearest second, if blockdurations of less than 1 second were to be permitted, then two or moreblocks could be allocated the same time stamp value. Therefore, when USTtime stamps are employed, block durations less than 1 second are notpermitted. This is implemented by specifying a predetermined blockduration, e.g. 1 second, that a current block must exceed beforesplitting of the current block is performed. This requirement willoperate in addition to the earlier described requirement that thecurrent block must contain greater than a predetermined number of nodesbefore splitting of the current block is performed. Alternatively,shorter block durations may be accommodated, by employing a time stampconvention other than UST and then rounding down the block marker timesmore precisely than the minimum allowed duration of a block.

In the above embodiments the phoneme and word lattice structure wasdetermined and generated by the automatic speech recognition unit 33,configured with the requisite functionality. As will readily beappreciated by those skilled in the art, a standard automatic speechrecognition unit can be used instead, in conjunction with a separatelattice creation unit comprising the functionality for determining andgenerating the above described phoneme and word lattice structure. Anembodiment employing a standard automatic speech recognition unit 40,which outputs a sequence of phonemes is shown in FIG. 24. As was thecase for the arrangement shown in earlier FIG. 3, the word decoder 37identifies words from the phoneme data 35. In the embodiment illustratedin FIG. 24, the identified words are added to the phoneme data to formphoneme and word data 42. This is then passed to a lattice creation unit44, which determines and generates the above described phoneme and wordlattice structure which forms the phoneme and word annotation data 31-3.In other embodiments, which include a standard automatic speechrecognition unit which only outputs words, a word to phoneme dictionarycan be used to generate phonemes, and then the words and phonemes arecombined and formed into the above described phoneme and word latticestructure by a lattice creation unit (not shown).

In the above embodiments, the phoneme and word data was associated withthe links of the lattice. As those skilled in the art will appreciate,the word and/or the phoneme data can be associated with the nodesinstead. In this case the data associated with each node wouldpreferably include a start and an end time for each word or phonemeassociated therewith.

A technique has been described above for organising an unordered list ofnodes and links into an ordered and blocked list. The technique has beendescribed for the particular application of the ordering of an unorderedlist of phonemes and words. However, as those skilled in the art willappreciate, this technique can be applied to other types of datalattices. For example the technique can be applied to a lattice whichonly has phonemes or a lattice which only has words. Alternativelystill, it can be applied to a lattice generated from a hand writingrecognition system which produces a lattice of possible characters as aresult of a character recognition process. In this case, the nodes andlinks would not be ordered in time, but would be spatially ordered sothat the characters appear in the ordered lattice at a position whichcorresponds to the character's position on the page relative to theother characters.

1. An apparatus for searching a database, comprising data defining aphoneme and/or word lattice for use in the database, said datacomprising data for defining a plurality of time-ordered nodes withinthe lattice, data for defining a plurality of links within the lattice,each link extending from a first node to a second node, data forassociating a phoneme or a word with at least one node or link, and datafor arranging the nodes in a sequence of time-ordered blocks so thatlinks from nodes in any given block do not extend beyond the nodes in ablock that is a predetermined number of blocks later in the sequence, inresponse to an input query, by a user, the apparatus comprising: meansfor generating phoneme data corresponding to the user's input query;means for searching the phoneme and word lattice using the phoneme datagenerated for the input query; and means for outputting search resultsin dependence upon the output from said searching means.
 2. An apparatusaccording to claim 1, further comprising means for generating word datacorresponding to the user's input query and means for searching thephoneme and word lattice using the word data generated for the inputquery.
 3. A method of searching a database, comprising data defining aphoneme and/or word lattice for use in the database, said datacomprising data for defining a plurality of time-ordered nodes withinthe lattice, data for defining a plurality of links within the lattice,each link extending from a first node to a second node, data forassociating a phoneme or a word with at least one node or link, and datafor arranging the nodes in a sequence of time-ordered blocks so thatlinks from nodes in any given block do not extend beyond the nodes in ablock that is a predetermined number of blocks later in the sequence, inresponse to an input query by a user, the method comprising the stepsof: generating phoneme data corresponding to the user's input query;searching the phoneme and word lattice using the phoneme data generatedfor the input query; and outputting search results in dependence uponthe results of said searching step.
 4. A method according to claim 3,further comprising the steps of generating word data corresponding tothe users input query and searching the phoneme and word lattice usingthe word data generated for the input query.
 5. An apparatus forgenerating annotation data for use in annotating a data file, theapparatus comprising: a receiver operable to receive phoneme and/or worddata; and a first generator operable to generate annotation datadefining a phoneme and/or word lattice corresponding to the receivedphoneme and/or word data; wherein the first generator comprises: asecond generator operable to generate node data defining a plurality oftime-ordered nodes within the lattice; a third generator operable togenerate link data defining a plurality of links within the lattice,each link extending from a first node to a second node; a fourthgenerator operable to generate association data associating each node orlink with a phoneme or word from the phoneme and/or word data; and afifth generator operable to generate block data for arranging the nodesin a sequence of time-ordered blocks fulfilling a block criteria inwhich links from nodes in any given block do not extend beyond the nodesin a block that is a predetermined number of blocks later in thesequence.
 6. An apparatus according to claim 5, wherein the blockcriteria is that links from nodes in any given block do not extendbeyond the nodes in the succeeding block.
 7. An apparatus according toclaim 5, wherein the first generator comprises a processor operable toform the phoneme and/or word lattice by processing the node data foreach node and the link data for each link, the processor comprising: i)an adder operable to add one or more nodes and associated link or linksto a current block of the lattice until the number of nodes in thecurrent block reaches a predetermined number; ii) a first determineroperable to determine if the current block can be split in accordancewith said block criteria; and iii) a splitter operable to split thecurrent block into at least two blocks of nodes.
 8. An apparatusaccording to claim 7, operable to generate the node data and the linkdata in correspondence to the phoneme and/or word data separately foreach phoneme and/or word.
 9. An apparatus according to claim 8, operableto generate all the node data and all the link data prior to forming thephoneme and/or word lattice.
 10. An apparatus according to claim 8,operable to add the node data and link data for each phoneme and/or wordto the phoneme and/or word lattice incrementally as it is generated foreach said phoneme and/or word.
 11. An apparatus according to claim 10,operable to add the node data and link data incrementally by:determining if a node already exists for the start and end times for thecurrent phoneme or word being processed; adding to the lattice a node ornodes corresponding to the start and/or end time if they do not alreadyexist; and adding a link between the nodes corresponding to the startand end times for the current phoneme or word being processed.
 12. Anapparatus according to claim 7, further comprising a second determineroperable to determine a first timing or nodal point (β) for each blockidentifying the latest node in the block to which any link originatingin the preceding block extends and a second timing or nodal point (α)for each block identifying the earliest node in the block from which alink extends into the succeeding block; and wherein the first determineris operable to determine that the current block of nodes can be split inaccordance with said block criteria by determining that the first timingor nodal point (β) is before the second timing or nodal point (α) andwherein the splitter is operable to split the current block responsiveto the first determiner determining that the current block of nodes canbe split.
 13. An apparatus according to claim 12, wherein the seconddeterminer is operable to update the first timing or nodal point (β) andthe second timing or nodal point (α) for each block, on addition offurther nodes to the lattice.
 14. An apparatus according to claim 12,wherein the splitter is operable to split the current block between thefirst timing or nodal point (β) and the second timing or nodal point(α).
 15. An apparatus according to claim 14, wherein the sixth generatorcomprises one of the following: a) a processor operable to receive andprocess an input voice annotation signal; b) a processor operable toreceive and process a text annotation; and c) a processor operable toreceive image data representative of a text document and a characterrecognition unit for converting said image data into text data.
 16. Anapparatus according to claim 12, wherein the splitter is operable tosplit the current block by forming a new block starting at or near thefirst timing or nodal point (β) of the preceding block and ending at ornear the first timing or nodal point (β) of the current block.
 17. Anapparatus according to claim 12, wherein the splitter is operable tosplit the current black by forming a new block starting at or near thefirst timing or nodal point (β) of the preceding block and ending at ornear the first timing or nodal point (β) of the current block if thefirst timing or nodal point (β) of the preceding block is later than thesecond timing or nodal point (α) of the preceding block, whereas thesplitter is operable to split the current block between the first timingor nodal point (β) and the second timing or nodal point (α) if the firsttiming or nodal point (β) of the preceding block is earlier than thesecond timing or nodal point (α) of the preceding block.
 18. Anapparatus according to claim 5, further comprising a sixth generatoroperable to generate the phoneme and/or word data from input audio ortext data.
 19. An apparatus according to claim 18, wherein the data filecomprises audio data, and the sixth generator comprises an automaticspeech recognition system for generating phoneme data for audio data inthe data file.
 20. An apparatus according to claim 19, wherein the sixthgenerator further comprises a word decoder for generating word data byidentifying possible words within the phoneme data generated by theautomatic speech recognition system.
 21. An apparatus according to claim18, wherein the data file comprises text data, and the sixth generatorcomprises a text-to-phoneme converter for generating phoneme data fromtext data in the data file.
 22. An apparatus according to claim 5,wherein said first generator is operable to generate data defining timestamp information for each of said nodes.
 23. An apparatus according toclaim 22, wherein said data file includes a time sequential signal, andwherein said first generator is operable to generate time stamp datawhich is time synchronised with said time sequential signal.
 24. Anapparatus according to claim 23, wherein said time sequential signal isan audio and/or video signal.
 25. An apparatus according to claim 5,wherein said first generator is operable to generate data which defineseach block's location within the database.
 26. A method of generatingannotation data for use in annotating a data file, the methodcomprising: i) receiving phoneme and/or word data; and ii) generatingannotation data defining a phoneme and/or word lattice corresponding tothe received phoneme and/or word data; wherein the step of generatingannotation data defining the lattice comprises: generating node datadefining a plurality of time-ordered nodes within the lattice;generating link data defining a plurality of links within the lattice,each link extending from a first node to a second node; generatingassociation data associating each link or node with a phoneme or wordfrom the phoneme and/or word data; and generating block data forarranging the nodes in a sequence of time-ordered blocks fulfilling ablock criteria in which links from nodes in any given block do notextend beyond the nodes in a block that is a predetermined number ofblocks later in the sequence.
 27. A method according to claim 26,wherein the block criteria is that links from nodes in any given blockdo not extend beyond the nodes in the succeeding block.
 28. A methodaccording to claim 26, wherein the step of generating annotation datadefining the lattice comprises the following steps for forming thephoneme and/or word lattice by processing the node data for each nodeand the link data for each link: i) adding one or more nodes andassociated link or links to a current block of the lattice until thenumber of nodes in the current block reaches a predetermined number; ii)determining that the current block can be split in accordance with saidblock criteria; and iii) splitting the current block into at least twoblocks of nodes.
 29. A method according to claim 28, wherein the nodedata and the link data is generated in correspondence to the phonemeand/or word data separately for each phoneme and/or word.
 30. A methodaccording to claim 29, wherein all the node data and all the link datais generated prior to forming the phoneme and/or word lattice.
 31. Amethod according to claim 29, wherein the node data and link data foreach phoneme and/or ward is added to the phoneme and/or word latticeincrementally as it is generated for each said phoneme and/or word. 32.A method according to claim 31, wherein the node data and link data isadded incrementally by: determining if a node already exists for thestart and end times for the current phoneme or Word being processed;adding to the lattice a node or nodes corresponding to the start and/orend time if they do not already exist; and adding a link between thenodes corresponding to the start and end times for the current phonemeor word being processed.
 33. A method according to claim 28, furthercomprising determining a first timing or nodal point (β) for each blockidentifying the latest node in the block to which any link originatingin the preceding block extends and a second timing or nodal point (α)for each block identifying the earliest node in the block from which alink extends into the succeeding block; and wherein the step ofdetermining that the current block of nodes can be split in accordancewith said block criteria comprises determining that the first timing ornodal point (β) is before the second timing or nodal point (α) andwherein the current block is split into the at least two blocks inresponse to it being determined that the current block of nodes can besplit.
 34. A method according to claim 33, further comprising updatingthe first dining or nodal point (β) and the second timing or nodal point(α) for each block, on addition of further nodes to the lattice.
 35. Amethod according to claim 33, wherein the step of splitting the currentblock comprises splitting the current block between the first timing ornodal point (β) and the second timing or nodal point (α).
 36. A methodaccording to claim 33, wherein the step of splitting the current blockcomprises forming a new block starting at or near the first timing ornodal point (β) of the preceding block and ending at or near the firsttiming or nodal point (β) of the current block.
 37. A method accordingto claim 33, wherein the step of splitting the current block comprisesforming a new block starting at or near the first timing or nodal point(β) of the preceding block and ending at or near the first timing ornodal point (β) of the current block when the first timing or nodalpoint (β) of the preceding block is later than the second timing ornodal point (α) of the preceding block, whereas it comprises splittingthe current block between the first timing or nodal point (β) and thesecond timing or nodal point (α) if the first timing or nodal point (β)of the preceding block is earlier than the second timing or nodal point(α) of the preceding block.
 38. A method according to claim 26, furthercomprising toe step of generating the phoneme and/or word data frominput audio or text data.
 39. A method according to claim 38, whereinthe data file comprises audio data, and the step of generating thephoneme and word data comprises: using an automatic speech recognitionsystem to generate phoneme data for audio data in the data file; andusing a word decoder to generate word data by identifying possible wordswithin the phoneme data generated by the automatic speech recognitionsystem.
 40. A method according to claim 38, wherein the data filecomprises text data, and the step of generating the phoneme and worddata comprises using a text-to-phoneme converter to generate phonemedata from text data in the data file.
 41. A method according to claim38, wherein the step of generating the phoneme and/or word datacomprises one of the following group: a) receiving and processing aninput voice annotation signal; b) receiving and processing a textannotation; and c) receiving image data representative of a textdocument and a character recognition unit for converting said image datainto text data.
 42. A method according to claim 26, further comprisinggenerating data defining time stamp information for each of said nodes.43. A method according to claim 42, wherein said data file includes atime sequential signal, and wherein the generated time stamp data istime synchronised with said time sequential signal.
 44. A methodaccording to claim 43, wherein said time sequential signal is an audioand/or video signal.
 45. A method according to claim 26, furthercomprising generating data which defines each block's location withinthe database.
 46. A method according to claim 26, further comprisingforming the phoneme and/or word lattice by processing the node data foreach node and the link data for each link by; i) adding node data fortwo nodes and link data for one or mote links therebetween; ii) addingblock data to provide an initial block of nodes constituted by the twoadded nodes; iii) adding to the initial block of nodes further node dataand/or link data for one or more thither nodes and/or links; iv)repeating (iii) until the number of nodes in the initial block reaches apredetermined number of nodes; v) determining that the initial block ofnodes can be split in accordance with said block criteria; vi) addingfurther block data to split the initial block of nodes into at least twocurrent blocks of nodes; vii) adding to one of the current blocks ofnodes further node data and/or link data for one or more further nodesand/or links; viii) repeating (vii) until the number of nodes in anycurrent block is identified as reaching the predetermined number ofnodes; ix) determining that the identified current block can be split inaccordance with said block criteria; x) adding further block data tosplit the identified current block into at least two blocks; xi)repeating (viii), (ix) and (x) if required until the node data and linkdata for all of the nodes and links generated for the phoneme and/orword data has been added to the phoneme and/or word lattice.
 47. Anapparatus for generating annotation data for use in annotating a datafile, the apparatus comprising: receiving means for receiving phonemeand/or word data; and first generating means for generating annotationdata defining a phoneme and/or word lattice corresponding to thereceived phoneme and/or word data; wherein the first generating meanscomprises: second generating means for generating node data defining aplurality of time-ordered nodes within the lattice; third generatingmeans for generating link data defining a plurality of links within thelattice, each link extending from a first node to a second node; fourthgenerating means for generating association data associating each nodeor link with a phoneme or word from the phoneme and/or word data; andfifth generating means for generating block data for arranging the nodesin a sequence of time-ordered blocks fulfilling a block criteria inwhich links from nodes in any given block do not extend beyond the nodesin a block that is a predetermined number of blocks later in thesequence.
 48. A computer readable medium storing computer executableinstructions for causing a programmable computer device to carry out amethod of searching a database, comprising data defining a phonemeand/or word lattice for use in the database, said data comprising datafor defining a plurality of time-ordered nodes within the lattice, datafor defining a plurality of links within the lattice, each linkextending from a first node to a second node, data for associating aphoneme or a word with at least one node or link, and data for arrangingthe nodes in a sequence of time-ordered blocks so that links from nodesin any given block do not extend beyond the nodes in a block that is apredetermined number of blocks later in the sequence in response to aninput query by a user, the instructions comprising: instructions forgenerating phoneme data corresponding to the user's input query;instructions for searching the phoneme and word lattice using thephoneme data generated for the input query; and instructions foroutputting search results in dependence upon the results of saidsearching step.
 49. Computer executable instructions for causing aprogrammable computer device to carry out a method of searching adatabase, comprising data defining a phoneme and/or word lattice for usein the database; said data comprising data for defining a plurality oftime-ordered nodes within the lattice, data for defining a plurality oflinks within the lattice, each link extending from a first node to asecond node, data for associating a phoneme or a word with at least onenode or link, and data for arranging the nodes in a sequence oftime-ordered blocks so that links from nodes in any given block do notextend beyond the nodes in a block that is a predetermined number ofblocks later in the sequence, in response to an input query by a user,the instructions comprising: instructions for generating phoneme datacorresponding to the user's input query; instructions for searching thephoneme and word lattice using the phoneme data generated for the inputquery; and instructions for outputting search results in dependence uponthe results of said searching step.
 50. A computer readable mediumstoring computer executable instructions for causing a programmablecomputer device to carry out a method of generating annotation data foruse in annotating a data file, the computer executable instructionscomprising: instructions for receiving phoneme and/or word data; andinstructions for generating annotation data defining a phoneme and/orword lattice corresponding to the received phoneme and/or word data;wherein the instructions for generating annotation data defining thelattice comprise: instructions for generating node data defining aplurality of time-ordered nodes within the lattice; instructions forgenerating link data defining a plurality of links within the lattice,each link extending from a first node to a second node; instructions forgenerating association data associating each link or node with a phonemeor word from the phoneme and/or word data; and instructions forgenerating block data for arranging the nodes in a sequence oftime-ordered blocks fulfilling a block criteria in which links fromnodes in any given block do not extend beyond the nodes in a block thatis a predetermined number of blocks later in the sequence.
 51. Computerexecutable instructions for causing a programmable computer device tocarry out a method of generating annotation data for use in annotating adata file, the computer executable instructions comprising: instructionsfor receiving phoneme and/or word data; and instructions for generatingannotation data defining a phoneme and/or word lattice corresponding tothe received phoneme and/or word data; wherein the instructions forgenerating annotation data defining the lattice comprise: instructionsfor generating node data defining a plurality of time-ordered nodeswithin the lattice; instructions for generating link data defining aplurality of links within the lattice, each link extending from a firstnode to a second node; instructions for generating association dataassociating each link or node with a phoneme or word from the phonemeand/or word data; and instructions for generating block data forarranging the nodes in a sequence of time-ordered blocks fulfilling ablock criteria in which links from nodes in any given block do notextend beyond the nodes in a block that is a predetermined number ofblocks later in the sequence.